Internal State GPOMDP with Trace Filtering

نویسندگان

  • Douglas Aberdeen
  • Jonathan Baxter
  • Peter L. Bartlett
چکیده

GPOMDP is an algorithm for estimating the gradient of the average reward for arbitrary Partially Observable Markov Decision Processes (POMDPs) controlled by parameterized stochastic policies. It applies to purely reactive (memoryless) policies, or policies that generate actions as a function of finite histories of observations. Based on the fact that maintenance of a belief state is sufficient for optimal control in POMDPs, this paper extends GPOMDP to cover parameterized stochastic controllers with internal state. We also generalize the discounting of rewards used by GPOMDP and other RL algorithms to arbitrary IIR and FIR filters, and show how prior knowledge may be used to set the taps in the filters in a way that reduces variance in the gradient estimation. Several experimental results are presented, including large scale phoneme recognition.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

On Line Electric Power Systems State Estimation Using Kalman Filtering (RESEARCH NOTE)

In this paper principles of extended Kalman filtering theory is developed and applied to simulated on-line electric power systems state estimation in order to trace the operating condition changes through the redundant and noisy measurements. Test results on IEEE 14 - bus test system are included. Three case systems are tried; through the comparing of their results, it is concluded that the pro...

متن کامل

Infinite-Horizon Policy-Gradient Estimation

Gradient-based approaches to direct policy search in reinforcement learning have received much recent attention as a means to solve problems of partial observability and to avoid some of the problems associated with policy degradation in value-function methods. In this paper we introduce GPOMDP, a simulation-based algorithm for generating a biased estimate of the gradient of the average reward ...

متن کامل

Direct Gradient-Based Reinforcement Learning: II. Gradient Ascent Algorithms and Experiments

In [2] we introduced GPOMDP, an algorithm for computing arbitrarily accurate approximations to the performance gradient of parameterized partially observable Markov decision processes (POMDPs). The algorithm’s chief advantages are that it requires only a single sample path of the underlying Markov chain, it uses only one free parameter 2 [0; 1) which has a natural interpretation in terms of bia...

متن کامل

Reinforcement Learning in POMDP's via Direct Gradient Ascent

This paper discusses theoretical and experimental aspects of gradient-based approaches to the direct optimization of policy performance in controlled POMDPs. We introduce GPOMDP, a REINFORCE-like algorithm for estimating an approximation to the gradient of the average reward as a function of the parameters of a stochastic policy. The algorithm’s chief advantages are that it requires only a sing...

متن کامل

Robust state estimation in power systems using pre-filtering measurement data

State estimation is the foundation of any control and decision making in power networks. The first requirement for a secure network is a precise and safe state estimator in order to make decisions based on accurate knowledge of the network status. This paper introduces a new estimator which is able to detect bad data with few calculations without need for repetitions and estimation residual cal...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007